Skip to content

indexed nearest-by join #1

Draft
sezruby wants to merge 10 commits into
mainfrom
knn-phase0
Draft

indexed nearest-by join #1
sezruby wants to merge 10 commits into
mainfrom
knn-phase0

Conversation

@sezruby

@sezruby sezruby commented May 7, 2026

Copy link
Copy Markdown
Owner

Summary

Indexed approximate-nearest-neighbor join over Lance's vector indexes for Apache Spark. Adds an idiomatic DataFrame API plus a Catalyst integration that intercepts Spark 4.2's NearestByJoin operator and routes it to a per-fragment Lance probe + Spark-side merge instead of the default O(|L|×|R|) cross-product rewrite.

👉 Reviewer guide + upstream delivery plan

This branch is ~13 K LoC and not ideal for single-pass review. Before diving in:

  • REVIEWER_GUIDE.md — reading order, file map, trust-but-verify checklist. Reads in ~10 minutes. Start here.
  • UPSTREAM_DELIVERY_PLAN.md — how this branch will be split into 7 smaller PRs for lance-format/lance-spark:main. Documents redundancy to clean up and explicit out-of-scope items (benchmarks, the Spark 4.2 module, deployment-specific build config).

Most reviewers will want REVIEWER_GUIDE.md § "Start here" (3 files, 10 min) → § "Next: read the engine" (4 staged/ files, 30 min) → skim the test map. That's enough context to form an opinion on the design.

👉 For apache/spark maintainers on SPARK-56395

NEARESTBYJOIN_ANN_PROPOSAL.md frames this PoC as "one concrete implementation of the indexed path SPARK-56395 mentions as future work." Doesn't propose any apache/spark changes; uses only the extension points SPARK-56395 + existing injectPostHocResolutionRule already provide. Includes a section on extending to parquet/delta via Lance sidecar pattern with honest assessment of where the random-access cost limits the sidecar approach, plus five open questions about small apache/spark additions that would help downstream implementers.

Commits

Organized as 9 feature-boundary commits matching the upstream delivery plan:

  1. feat(knn): Phase 0 foundation — LanceProbe primitive + metric types
  2. feat(knn): staged RDD pipeline + IndexedNearestJoin.apply + bounded TopKHeap
  3. feat(knn): Phase 1.5 — fragment-grouped probing for multi-task parallelism
  4. feat(knn): 3-exec Catalyst-visible staged plan with AQE-visible merge shuffle
  5. feat(knn): df.kNearestJoin DataFrame extension method
  6. feat(knn): Phase 3 hardening — refineFactor, prefilter pushdown, IVF-PQ recall
  7. feat(knn): Spark 4.2 SQL integration — IndexedNearestByJoinRule
  8. test(knn-bench): benchmark suite — synthetic, Wikipedia perf, SIFT/Cohere recall, SQL
  9. docs(knn): design, impl plan, reviewer guide, ANN proposal, benchmark results

Each commit bodies carries the "why" for that feature — see the log.

Headlines

Correctness: 60 tests in lance-spark-knn_2.12 + 17 tests in lance-spark-knn-4.2_2.13. Every benchmark gates through a 16-row brute-force oracle before quoting numbers. IVF-PQ recall@10 = 1.00 with refineFactor=8; SIFT1M IVF-FLAT recall@10 = 0.98 at nprobes=16, 1.00 at nprobes=64 (within noise of published FAISS numbers).

Perf (OSS Spark 3.5 cluster, 8 × 4c/16g executors on Kubernetes): Cohere wikipedia-2023-11-embed-multilingual-v3 at dim=1024, |R|=1K × |L|=50 — indexed path is 100–200× faster than Spark's cross-product baseline (7-iter median 160×; variance from multi-tenant CPU contention, order-of-magnitude is robust). Measured with write.format("noop") timing sink and oracle-gated correctness. Baseline is tight at ±2% (~65s); indexed path lands at 400–500ms.

Perf (Apple M5 Max, synthetic dim=128):

Benchmark Comparison Headline (small scale, oracle-validated)
DataFrame indexed pipeline (no index) vs. naive Spark crossJoin + array_distance UDF + row_number window ~551–608× (≈110,000 ms → ≈180–200 ms)
SQL e2e (no index) rule ON vs. OFF (= Spark's RewriteNearestByJoin cross-product + min_by_k) 17× (3,766 ms → 232 ms)
SQL e2e (IVF_FLAT, default) rule ON + actual vector index vs. rule OFF 27× at recall = 1.00 (3,766 ms → 138 ms)

Full results + methodology in BENCHMARK_RESULTS.md.

Public API

// Idiomatic DataFrame API — works on Spark 3.5 / 4.0 / 4.1 / 4.2+
import org.lance.spark.knn.LanceKnnImplicits._

val docs = spark.read.format("lance").load("/path/to/lance")
val joined = queries.kNearestJoin(
  right = docs,
  leftVecCol = "qvec",
  rightVecCol = "vec",
  k = 10,
  metric = "l2",                       // l2 | cosine | dot
  probeParallelism = 4,                // Phase 1.5: fragment-grouped probing
  refineFactor = Some(8),              // Phase 3: IVF-PQ recall pass
  balanceFragments = true)             // Phase 3: skew handling

// SQL (Spark 4.2-SNAPSHOT — NearestByJoin added in SPARK-56395)
SparkSession.builder()
  .config("spark.sql.extensions", "org.lance.spark.knn.extensions.LanceKnnSparkSessionExtensions")
  .config("spark.lance.knn.indexedNearestByJoin.enabled", "true")
  .config("spark.lance.knn.nprobes", "4")              // optional recall tuning
  .config("spark.lance.knn.refineFactor", "8")         // optional recall tuning
  .getOrCreate()
// SELECT * FROM queries q INNER JOIN docs d
//   APPROX NEAREST 10 BY DISTANCE vector_l2_distance(q.vec, d.vec)

Architecture

left.analyzed
  -- LanceProbeLogicalPlan                          --> [_leftId, leftRow fields..., _refs]
  -- LanceMergeLogicalPlan                          --> same shape
  -- LanceMaterializeLogicalPlan                    --> join output schema

lowered via LanceKnnStagedStrategy to:

  LanceProbeExec
    -> ShuffleExchangeExec hashpartitioning(_leftId)   <- Catalyst inserts this via
    -> LanceMergeExec                                     EnsureRequirements from
    -> LanceMaterializeExec                               LanceMergeExec.requiredChildDistribution
                                                          = ClusteredDistribution(_leftId)

all wrapped by AdaptiveSparkPlanExec

Both the DataFrame API (IndexedNearestJoin.applyLanceKnnDatasetBridge) and the SQL path (IndexedNearestByJoinRule in Spark 4.2) emit this same 3-plan logical tree and share LanceKnnStagedStrategy for the physical lowering.

df.explain() shows four Catalyst nodes (LanceProbe → Exchange → LanceMerge → LanceMaterialize) wrapped by AdaptiveSparkPlanExec. With AQE enabled, AQEShuffleRead coalesced appears on the merge-side shuffle after the first collection.

Key subtlety — injectPostHocResolutionRule, NOT injectOptimizerRule. Spark's built-in RewriteNearestByJoin runs in the optimizer's FinishAnalysis batch (the very first batch); rules added via injectOptimizerRule fire in operatorOptimizationBatch which runs AFTER FinishAnalysis. By the time an injected optimizer rule fires, the NearestByJoin operator has already been rewritten to a cross-product, so the SQL integration uses injectPostHocResolutionRule — the only injection point that sees the unrewritten operator. Documented in the rule's scaladoc and in DESIGN.md § "Why injectPostHocResolutionRule".

ColumnPruning subtlety. LanceMergeLogicalPlan and LanceMaterializeLogicalPlan override lazy val references = child.outputSet. Without this override, Catalyst's ColumnPruning rule inserts Project(Nil) wrappers between custom nodes when downstream consumers (count(*), Aggregate) reference no columns; ProjectExec(Nil) codegens to 0-field UnsafeRows, and ProbedLeftCodec.Decoder crashes reading ir.getLong(0) — AssertionError under interpreter, SIGSEGV under C2 JIT. See IMPL_PLAN.md "3-exec staged split — root cause and fix" for the full investigation + isolation walk. Structurally pinned by StagedPlansReferencesTest.

Cross-version

Single source compiles + tests pass against Spark 3.5 AND 4.0 on the _2.13 module. LanceKnnDatasetBridge uses a reflection-based Dataset.ofRows lookup that tries both org.apache.spark.sql.Dataset (3.x) and org.apache.spark.sql.classic.Dataset (4.0+), caching the winner per-session. Full CI matrix (3.4 / 3.5 / 4.0 / 4.1) still TODO; see IMPL_PLAN.md "Cross-version DataFrame API parity".

Test coverage

  • Oracle equivalence on both real + synthetic data at multiple dimensionalities.
  • AQE visibility (ShuffleExchangeExec hashpartitioning(_leftId) present under AdaptiveSparkPlanExec).
  • Consumer-shape regression suite (count(), agg(count("*")), select(lit(1)), collect() — the crash-shapes from the ColumnPruning investigation).
  • JIT stress (20 iterations × 10K × 100 × dim=128).
  • Recall suites for IVF-FLAT + IVF-PQ with refineFactor.
  • Real-backend SQL e2e on Spark 4.2-SNAPSHOT (oracle-gated, WHERE-pushdown round-trip).

All details in REVIEWER_GUIDE.md § "Test map" and § "Trust-but-verify checklist".

Out of scope / follow-up

Per UPSTREAM_DELIVERY_PLAN.md:

  • Cost gate replacing the current opt-in spark.lance.knn.indexedNearestByJoin.enabled flag.
  • Spark version CI matrix (compile+test verified on 3.5 and 4.0; formal CI job TODO).
  • AQE-visible shuffle for the fragment-grouped probe path (runWithFragmentGroups's internal partitionBy remains RDD-level; the merge-side shuffle IS AQE-visible).
  • Per-executor LanceProbe cache to amortize dataset open across small partitions.
  • Skew handling for the left side.
  • PySpark wrapper for df.kNearestJoin.

Documentation

  • REVIEWER_GUIDE.md — reading order, file map, "where to check first", trust-but-verify checklist, test map. For reviewers: start here.
  • UPSTREAM_DELIVERY_PLAN.md — 7-PR split strategy for delivering to lance-format/lance-spark:main, redundancy cleanup list, out-of-scope items.
  • NEARESTBYJOIN_ANN_PROPOSAL.md — SPARK-56395-specific framing for apache/spark maintainers; sidecar pattern for parquet/delta + five open questions.
  • DESIGN.md — overall feature design, architecture with the 4-node physical plan, the "why no-index Lance still beats Spark cross-product" SIMD/columnar/no-Catalyst breakdown, and benchmark validation methodology.
  • IMPL_PLAN.md — architecture sketch + phase plan + Phase 3.x backlog table + "3-exec staged split — root cause and fix" post-mortem section.
  • PHASE_PROGRESS.md — resume-without-context notes.
  • BENCHMARK_RESULTS.md — M5 Max local numbers plus OSS Spark 3.5 cluster numbers (100–200× speedup on Cohere wiki dim=1024, SIFT1M recall, sustained-load soak).

🤖 Generated with Claude Code

@sezruby sezruby changed the title WIP: indexed nearest-by join (Phase 0 — DataFrame API) indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmark (608× vs Spark crossJoin) May 7, 2026
@sezruby sezruby changed the title indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmark (608× vs Spark crossJoin) WIP: indexed nearest-by join May 7, 2026
@sezruby sezruby changed the title WIP: indexed nearest-by join indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmarks (608× DF, 18× SQL e2e) May 7, 2026
@sezruby sezruby changed the title indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmarks (608× DF, 18× SQL e2e) indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmarks (608× DF, 17× SQL, oracle-validated) May 7, 2026
@sezruby sezruby changed the title indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmarks (608× DF, 17× SQL, oracle-validated) indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmarks (608× DF, 27× SQL @ recall 1.0) May 7, 2026
@sezruby sezruby changed the title indexed nearest-by join — Phase 0/1/1.5/2/3 + benchmarks (608× DF, 27× SQL @ recall 1.0) indexed nearest-by join — Phase 0/1/1.5/2/3.x + benchmarks (608× DF, 27× SQL @ recall 1.0, AQE-visible) May 8, 2026
@sezruby sezruby changed the title indexed nearest-by join — Phase 0/1/1.5/2/3.x + benchmarks (608× DF, 27× SQL @ recall 1.0, AQE-visible) indexed nearest-by join — Phase 0/1/1.5/2/3.x + benchmarks (608× DF, 27× SQL @ recall 1.0) May 8, 2026
@sezruby sezruby changed the title indexed nearest-by join — Phase 0/1/1.5/2/3.x + benchmarks (608× DF, 27× SQL @ recall 1.0) indexed nearest-by join May 11, 2026
sezruby and others added 3 commits May 12, 2026 09:59
Introduce the per-task vector-search primitive and its supporting types.
`LanceProbe` opens a Lance dataset once and drives `nearest()` + row-id-based
materialize calls against it. Unit-tested with recall=1.0 against brute-force
oracles on both uniform-random and clustered-embedding fixtures.

New artifacts:
  - `LanceProbe` — open-once, probe-many wrapper around Lance's Java API.
  - `Metric` — L2 / Cosine / Dot enum, `smallerIsBetter` flag threaded
    through to ordering logic.
  - `ScoredRowRef` — (rowId, score) pair crossing the inter-stage boundary.
  - `LanceProbeValidationTest` + `LanceVectorIndexBuilder` test helper.
  - New `lance-spark-knn_2.12` / `_2.13` modules in the reactor.

No functional impact on existing modules; `lance-spark-knn` is an additive
module reachable only through its own API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…opKHeap

Build the three-stage kNN-join pipeline on top of LanceProbe:

  - LanceProbeStage — per-task nearest-search emitting (leftId, ProbedLeft).
  - LanceMergeStage — per-partition bounded-heap merge of contributions
    per leftId, trimming to finalK.
  - LanceMaterializeStage — point-fetch right rows by _rowid, assemble
    final join Rows.

Plus the TopKHeap primitive (metric-aware bounded heap for the merge-side
aggregation) and the public entry point `IndexedNearestJoin.apply(left,
rightLanceUri, leftVecCol, rightVecCol, k, metric, scoreCol)`.

End-to-end tested: `IndexedNearestJoinCorrectnessTest` verifies recall=1.0
vs. an in-memory brute-force oracle at 1K × 100 × dim=16. `IndexedNearestJoinTest`
covers the public-API surface (left-outer join, custom score column,
projection list, refine factor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elism

Opt-in `probeParallelism: Int = 1` parameter on `IndexedNearestJoin.apply`.
When set > 1, the driver enumerates Lance fragments via `Dataset.getFragments()`,
groups them (round-robin or LPT bin-packing when `balanceFragmentsByRowCount
= true`), and replicates each left row across the groups so N parallel tasks
each probe a disjoint fragment subset. Downstream merge aggregates the N
contributions per leftId.

The bandwidth win the staged design promises only lands here — Phase 0/1
had the shape but a single contributor per leftId (degenerate merge). Phase
1.5 makes the merge stage do real work across tasks.

Edge case: when `probeParallelism > numFragments`, only one group has
fragments and the rule degenerates gracefully back to the single-task path,
avoiding a replicate shuffle for nothing.

Oracle-verified for G=4 and G=8 (with and without skew-balanced grouping)
against brute force.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sezruby and others added 6 commits May 12, 2026 11:31
… shuffle

Surface the staged pipeline as explicit Spark operators so `df.explain()`
shows the shape and Catalyst/AQE can engage on the merge shuffle:

  LanceProbeExec
    -> ShuffleExchangeExec hashpartitioning(_leftId)   <- Catalyst inserts this
    -> LanceMergeExec                                  <- via EnsureRequirements
    -> LanceMaterializeExec                            <- from
                                                          requiredChildDistribution
                                                          = ClusteredDistribution(_leftId)
                                                          on LanceMergeExec

Wrapped by AdaptiveSparkPlanExec. With AQE on, `CoalesceShufflePartitions`
/ `OptimizeSkewJoin` / `OptimizeShuffleWithLocalRead` all engage on the
merge shuffle (visibly `AQEShuffleRead coalesced` in the executed plan).

ColumnPruning subtlety: `LanceMergeLogicalPlan` and
`LanceMaterializeLogicalPlan` override `lazy val references =
child.outputSet`. Without this override, Catalyst's ColumnPruning rule
inserts `Project(Nil)` between custom nodes when downstream consumers
reference no columns (count(*), agg, select(lit(1))); `ProjectExec(Nil)`
codegens to 0-field UnsafeRows which crash `ProbedLeftCodec.Decoder` at
`ir.getLong(0)` — AssertionError under interpreter, SIGSEGV under C2 JIT.
The override makes the custom nodes declare all child outputs load-bearing,
short-circuiting ColumnPruning's subset guard.

Inter-stage row format: `ProbedLeftCodec` uses a flat schema (leftId + left
columns inlined + refs array-of-struct) rather than nested struct — earlier
multi-pass / nested-struct codec attempts had binary-layout issues at
benchmark scale.

`LanceKnnDatasetBridge` in `org.apache.spark.sql` is a trampoline to the
package-private `Dataset.ofRows`, locating it via reflection: Spark 3.x
exposes it on `org.apache.spark.sql.Dataset`; Spark 4.0 moved the concrete
implementation to `org.apache.spark.sql.classic.Dataset`. The bridge tries
both at startup and caches the winner, so the knn module compiles + runs
against Spark 3.5, 4.0, 4.1, and 4.2-SNAPSHOT from a single source.

Five test suites pin the behavior: AQE visibility, plan shape, consumer
shape (the crash-shapes from the ColumnPruning investigation), JIT stress,
structural pin on the references override, plus the two isolation tests
from the post-mortem kept as regression coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-facing extension over `DataFrame` that mirrors `df.join(other, ...)`
and wraps `IndexedNearestJoin.apply` with right-side URI auto-extraction
from the analyzed plan.

  import org.lance.spark.knn.LanceKnnImplicits._
  leftDf.kNearestJoin(rightDf, leftVecCol = "v", rightVecCol = "v", k = 10)

Non-Lance right sides (parquet, in-memory, alias-wrapped non-Lance) fail
fast with IllegalArgumentException naming the constraint. Works on Spark
3.5 / 4.0 / 4.1 / 4.2+ via the reflection-based Dataset.ofRows lookup in
LanceKnnDatasetBridge (introduced in the preceding 3-exec commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PQ recall

Four substantive additions:

1. refineFactor / ef parameters on IndexedNearestJoin.apply, plumbed through
   LanceProbeStage.Conf to Query.Builder (setRefineFactor / setEf). IVF-PQ
   recall knob (fetches k*refineFactor PQ candidates, re-ranks with exact
   distance) and HNSW search-depth knob respectively. Defaults preserve
   current behavior.

2. balanceFragmentsByRowCount flag — LPT greedy bin-packing (4/3-optimal
   makespan approximation) on FragmentMetadata.getNumRows, used instead of
   round-robin when the fragment-row-count distribution is skewed.

3. Prefilter pushdown into the base module. Extends LanceFragmentScanner to
   carry a user-supplied SQL filter string, and LanceSparkReadOptions to
   serialize it. IndexedNearestJoin uses this to push right-side WHERE
   clauses into Lance's index-lookup path (prefilter = true is always set),
   so top-K is computed over only matching rows — correctness, not just
   perf: without prefilter, an indexed probe could return K rows all later
   filtered out, masking truly-nearest-but-also-matching rows further down
   the index.

4. Switched the whole pipeline from _rowaddr to _rowid. Lance's indexed
   nearest-search materializes _rowid but not _rowaddr; using _rowid on
   both probe + materialize paths makes it work for indexed AND non-indexed
   scans uniformly.

IndexedNearestJoinIvfPqRecallTest builds a real IVF-PQ index via Lance's
Java API and measures recall@K: 0.73 at defaults, 1.00 with refineFactor=8
(exact-distance re-rank recovers all true neighbors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New module `lance-spark-knn-4.2_2.13` adds a Catalyst postHocResolutionRule
that intercepts Spark 4.2's NearestByJoin (SPARK-56395) over a Lance scan
and emits the same 3-plan staged logical tree the DataFrame API path
builds. Shared `LanceKnnStagedStrategy` lowers both paths to the identical
LanceProbeExec -> ShuffleExchangeExec -> LanceMergeExec ->
LanceMaterializeExec chain.

Subtle: the rule MUST use `injectPostHocResolutionRule`, not
`injectOptimizerRule`. Spark's built-in RewriteNearestByJoin runs in the
optimizer's FinishAnalysis batch (the very first batch); rules added via
injectOptimizerRule fire in operatorOptimizationBatch, which runs AFTER
FinishAnalysis. By the time an injected optimizer rule fires, the
NearestByJoin operator has already been rewritten to a cross-product +
MaxMinByK plan — nothing left to pattern-match.

Pattern match recognizes the three SPARK-56395 ranking expressions
(VectorL2Distance + NearestByDistance, VectorCosineSimilarity +
NearestBySimilarity, VectorInnerProduct + NearestBySimilarity) over a
Lance DSv2 relation. Direction must match expression's natural ordering.
Rule is opt-in via `spark.lance.knn.indexedNearestByJoin.enabled`
(default false) until a cost-based gate lands in Phase 3.x.

Prefilter pushdown: unwraps Filter(cond, lance) and Project(<passthrough>,
Filter(...)), translates the predicate to Lance SQL (binary comparisons,
IN, IS [NOT] NULL, AND/OR/NOT over right-side attrs vs literals). Anything
else makes the rule REFUSE the rewrite (no partial pushdown — dropping a
residual conjunct would silently change query semantics).

Tests: IndexedNearestByJoinRuleTest covers the pattern-match positive +
negative cases and pins the emitted 3-plan tree shape.
IndexedNearestByJoinE2ETest drives a real Lance dataset end-to-end on
Spark 4.2-SNAPSHOT, asserts all three execs + the Catalyst-inserted
hashpartitioning(_leftId) exchange are in the physical plan, and matches
top-K results against an in-memory brute-force oracle at dim=16 +
dim=1024. Rule-off falls through to Spark's RewriteNearestByJoin and
still matches the oracle — proves the opt-in gate doesn't break
correctness.

Schema note: `NearestByJoin.output` widens every left+right attribute to
nullable=true (matching what Spark's default rewrite does via the First
aggregate). The rule widens the materialize stage's internal finalSchema
to match, keeping the ExpressionEncoder layout consistent with the
declared output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…here recall, SQL

Seven benchmarks validate correctness + perf + scaling on real + synthetic
data. All use `write.format("noop")` as the timing sink (Spark's canonical
benchmark pattern — materializes every row without a driver round-trip) and
gate correctness through a 16-row brute-force oracle before timed runs.

  - IndexedNearestJoinBenchmark  -- synthetic random, dim=128, 5 configs
                                    (crossJoin baseline + 4 indexed variants)
  - WikipediaKnnPerfBenchmark    -- Cohere wikipedia-2023-11-embed-multilingual
                                    parquet shards, dim=1024, real embeddings
  - SiftRecallBenchmark          -- canonical SIFT1M corpus, IVF-FLAT recall@10
  - CohereWikiRecallBenchmark    -- IVF-FLAT recall on Cohere wiki, dim=1024
  - IndexedNearestJoinSoakTest   -- concurrent sustained load (10-min smoke
                                    window, 492 queries, driver heap
                                    stability check)
  - IndexedNearestByJoinSqlBenchmark (in lance-spark-knn-4.2_2.13)
                                 -- SQL-path counterpart of the synthetic
                                    benchmark; measures rule ON vs OFF
  - InterStagePayloadOverheadBench (test-scope microbench)
                                 -- encode-decode overhead of ProbedLeftCodec
                                    at realistic row widths, measured <1% of
                                    total wall-clock at every SQL benchmark
                                    scale

Validation on a real OSS Spark 3.5 cluster (8 × 4c/16g executors, Kubernetes):
Cohere wiki dim=1024, |R|=1K × |L|=50 — indexed path is 100-200x faster
than crossJoin (7-iter median 160x; variance from multi-tenant CPU
contention, order-of-magnitude robust). SIFT1M IVF-FLAT recall@10 = 0.98 at
nprobes=16, 1.00 at nprobes=64 — within noise of published FAISS numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… results

Seven reviewer-facing docs live next to the lance-spark-knn_2.12 sources:

  - DESIGN.md                     -- end-to-end architecture, why no-index
                                     Lance beats Spark cross-product
                                     (SIMD / columnar / no-Catalyst
                                     breakdown), Phase 0-3.x evolution.
  - IMPL_PLAN.md                  -- original architecture sketch, phase
                                     plan, Phase 3.x backlog, the 3-exec
                                     staged split post-mortem (ColumnPruning
                                     -> Project(Nil) -> 0-field UnsafeRow
                                     -> SIGSEGV and how the references
                                     override fixed it).
  - PHASE_PROGRESS.md             -- resume-without-context notes for
                                     new-session reviewers.
  - REVIEWER_GUIDE.md             -- ~10-min reading order + file map +
                                     test map + trust-but-verify checklist.
                                     Start-here doc.
  - UPSTREAM_DELIVERY_PLAN.md     -- 7-PR split strategy for delivering
                                     the feature to lance-format/lance-spark,
                                     redundancy audit, explicit out-of-scope
                                     items.
  - BENCHMARK_RESULTS.md          -- local M5 Max numbers + OSS Spark 3.5
                                     cluster numbers with variance
                                     envelope, per-iteration tables, and
                                     reproduction instructions.
  - NEARESTBYJOIN_ANN_PROPOSAL.md -- standalone proposal doc for sharing
                                     with apache/spark maintainers on
                                     SPARK-56395. Frames the PoC as "one
                                     concrete implementation of the
                                     indexed-path follow-up" with Lance
                                     sidecar extension for parquet/delta
                                     and five open questions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sezruby sezruby force-pushed the knn-phase0 branch 2 times, most recently from 1d6ec2c to f875e88 Compare May 13, 2026 17:27
…aling sweep

Two changes to land alongside the cluster scaling runs:

1. Replace `row_number window` headline baseline with `groupBy + sort_array(K)` (config A2).
   The previous A baseline applies a `row_number().over(Window.partitionBy(lid))` over
   the full cross-product; that requires a global per-lid sort with no partial
   aggregation, runs hours at medium scale, and isn't what Spark 4.2's
   RewriteNearestByJoin actually produces. A2 uses
   `groupBy(lid).agg(slice(sort_array(collect_list(struct(dist, rid))), 1, K))` —
   the closest Spark 3.5 SQL expression of 4.2's `min_by(struct, expr, K)`
   (`MaxMinByK`, SPARK-55322). Spark applies partial aggregation per task so the
   shuffle volume stays bounded. A is preserved as opt-in via
   `BENCHMARK_INCLUDE_BASELINE_A=true`.

2. Add baseline-sweep + medium_l100 ground-truth scales for cross-cluster
   scaling characterization. Sample |R|={10K,50K,100K,200K} at fixed |L|=1000,
   plus one |R|=1M, |L|=100 ground truth (10x reduced |L|). Cross-product cost is
   linear in both |L| and |R|, so this combination lets us extrapolate full medium
   (|R|=1M, |L|=1K = 1B pairs) cheaply (~30 min cluster total) while validating the
   linearity assumption against an independent ground-truth measurement.

Two cluster knobs surfaced from the runs:
  - BENCH_DISABLE_AQE=true: AQE's CoalesceShufflePartitions throttles parallelism
    on small post-shuffle data (collapses 128 partitions to ~8), capping the
    cross-join compute stage at 8 parallel tasks regardless of cluster cores.
    Off for baseline runs; indexed runs benefit from AQE on the merge shuffle.
  - BENCH_BASELINE_RIGHT_PARTITIONS=N: repartition right side post-Lance-read so
    the fused cross-join compute stage gets enough tasks to use all cores.
    Default 64; matches an 8x8c cluster.

Doc update: BENCHMARK_RESULTS.md now has a "Synthetic benchmark" section with
the full cross-cluster sweep, big-vs-small comparison, and an honest variance
disclosure (multi-tenant ~20% noise envelope; noisy-neighbor pods that can
make one executor 2-3x slower across a whole run; executor-death retry
inflation). Includes setup instructions for first-time reviewers and
methodology callouts (oracle gating, noop sink, AQE rationale, A2 vs 4.2-native).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant